Skip to content

fix(subscriber): propagate handler error to engage startSubscriber backoff#3218

Open
NitinKumar004 wants to merge 2 commits into
gofr-dev:developmentfrom
NitinKumar004:fix/subscriber-error-swallowed
Open

fix(subscriber): propagate handler error to engage startSubscriber backoff#3218
NitinKumar004 wants to merge 2 commits into
gofr-dev:developmentfrom
NitinKumar004:fix/subscriber-error-swallowed

Conversation

@NitinKumar004
Copy link
Copy Markdown
Contributor

@NitinKumar004 NitinKumar004 commented Mar 27, 2026

Summary

handleSubscription logged handler errors with Errorf and returned nil. Because the error never reached startSubscriber, its err != nil branch — which sets delay = 2 * time.Second — never engaged, so a failing handler caused a tight resubscribe loop on every consecutive failure.

This PR:

Rebased on top of #3424.

Note on commit/redelivery behaviour: this is not a "data loss" fix — the handler-error path never reached msg.Commit() before either, so the broker already redelivers. The observable symptom this fixes is the tight loop, not message loss.

Fixes #3215

Test plan

@NitinKumar004 NitinKumar004 force-pushed the fix/subscriber-error-swallowed branch from 9f5b862 to 6a6435d Compare March 27, 2026 12:05
Copy link
Copy Markdown
Member

@Umang01-hash Umang01-hash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Previously:
{"level":"DEBUG","message":{"messageValue":"{\"orderId\":\"ORD-001\",...}","topic":"order-logs",...}}                                                                                           
  {"level":"INFO", "message":"handler received order #1: {OrderId:ORD-001 Status:FAILED}"}                                                                                                        
  {"level":"ERROR","message":"error in handler for topic order-logs: handler: downstream DB unavailable"}                                                                                         
                                               ↑ ONE error log, then immediately next message:                                                                                                    
  {"level":"DEBUG","message":{"messageValue":"{\"orderId\":\"ORD-BEFORE-FIX\",...}","topic":"order-logs",...}}                                                                                    
  {"level":"INFO", "message":"handler received order #2: {...}"}                                                                                                                                  
  {"level":"ERROR","message":"error in handler for topic order-logs: handler: downstream DB unavailable"}                                                                                         
                                               ↑ NO DELAY — tight loop!      

After fix (return err — PR change):

  {"level":"DEBUG","message":{"messageValue":"{\"orderId\":\"ORD-001\",...}","topic":"order-logs",...}}                                                                                           
  {"level":"INFO", "message":"handler received order #1: {OrderId:ORD-001 Status:FAILED}"}                                                                                                      
  {"level":"ERROR","message":"error in handler for topic order-logs: handler: downstream DB unavailable"}                                                                                         
  {"level":"ERROR","message":"error in subscription for topic order-logs: handler: downstream DB unavailable"}
                                 ↑ TWO error logs for the same error — from handleSubscription AND startSubscriber                                                                                
                                 ↑ Then 2-second delay before next message is attempted                                                                                                           
  {"level":"DEBUG","message":{"messageValue":"{\"orderId\":\"ORD-BEFORE-FIX\",...}","topic":"order-logs",...}}                                                                                    
                                               ↑ Next message 2 seconds later ✅ 

Every handler error fires both "error in handler for topic..." and "error in subscription for topic...". In a busy subscription loop under real failure conditions this doubles alert noise.

Maybe we can try a small fix: remove the Errorf on subscriber.go:70 and let startSubscriber's existing log (line 38) be the single error entry point.
That keeps error propagation clean and logs the error exactly once.

@NitinKumar004 NitinKumar004 force-pushed the fix/subscriber-error-swallowed branch from fa8abeb to 4e980a6 Compare April 11, 2026 19:00
@Umang01-hash
Copy link
Copy Markdown
Member

Conflicts with #3424 (merged) — needs rebase. After rebase the err != nil branch should keep the existing Errorf and return err (so startSubscriber's 2s backoff engages); leave the panic sentinel path returning nil.

Minor:

  • "Data loss" framing is off — messages aren't committed on handler error today, broker redelivers. Real symptom is a tight loop because handleSubscription returns nil. Worth updating the title/description.
  • subscriber_test.go will conflict too — fix(subscriber): emit error on handler panic recover to prevent commit of not processed messages #3424 already added errHandlerFail; reuse it. TestHandleSubscription_SubscribeError isn't testing this PR's change (mock already returned that error) — drop or fold in.
  • Follow-up: startSubscriber sets delay=2s on failure but never resets to 0 on success.

…ckoff

Previously, when a handler returned a non-panic error, handleSubscription
logged it and returned nil. That kept startSubscriber's loop tight: it
never tripped the err != nil branch, so the 2s backoff never engaged
and the loop hot-spun.

This change keeps the existing error log and returns the error so
startSubscriber backs off for 2s before re-subscribing. The recovered-
panic sentinel (errSubscriberHandlerPanic, added in gofr-dev#3424) still
returns nil so panicRecovery's log isn't duplicated.

Also reset the loop's delay to 0 after a successful iteration — without
that, a single failure pinned subsequent successful iterations at the
2s cadence forever.

Note: this does not change commit semantics. The handler-error path
never reached msg.Commit() before; the broker still redelivers.
@NitinKumar004 NitinKumar004 force-pushed the fix/subscriber-error-swallowed branch from 702b6c1 to 567d369 Compare May 14, 2026 12:27
@NitinKumar004 NitinKumar004 changed the title Fix: Subscriber handler errors silently swallowed causing potential data loss fix(subscriber): propagate handler error to engage startSubscriber backoff May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix: Subscriber handler errors silently swallowed causing potential data loss

2 participants